In [1]:
import gcp.bigquery as bq
Many different types of samples were obtained from the TCGA participants, and details about these samples are available in the Biospecimen data table. This next query shows how many samples exist of each type, as well as the full names and abbreviations of each type:
In [2]:
%%sql
SELECT
SampleType,
SampleTypeLetterCode,
COUNT(*) AS n
FROM
[isb-cgc:tcga_201607_beta.Biospecimen_data]
GROUP BY
SampleType,
SampleTypeLetterCode,
ORDER BY
n DESC
Out[2]:
Note that there are many types of tumor samples: primary, metastatic, recurrent, etc, although the vast majority are samples from primary tumors. In the TCGA project, almost all tumor samples were assayed on multiple platforms for mRNA and miRNA expression, DNA methylation, DNA copy-number, and either exome- or whole-genome DNA sequence. For some tumor samples, protein activity was also measured using RPPA arrays. When available, adjacent "normal" tissue samples were also assayed on a subset of these platforms. The "blood normal" samples were primarily used only as a reference source of germline DNA in order to call somatic mutations.
We can do a similar counting exercise of the sample types represented in one of the molecular data tables, using one of the mRNA expression data tables:
In [3]:
%%sql
SELECT
SampleTypeLetterCode,
COUNT(*) AS n
FROM (
SELECT
SampleBarcode,
SampleTypeLetterCode
FROM
[isb-cgc:tcga_201607_beta.mRNA_UNC_HiSeq_RSEM]
GROUP BY
SampleBarcode,
SampleTypeLetterCode )
GROUP BY
SampleTypeLetterCode
ORDER BY
n DESC
Out[3]:
In this example, let's assume that we would like to do a study that requires a primary tumor sample and a matched-normal (adjacent) tissue sample. In order to find out which patients provided which types of samples, we need to query the Biospecimen data table. This next query module uses two sub-queries, one to get all patients with TP samples and another to get all patients with NT samples. The final query joins these two and returns a single list of patients.
In [4]:
%%sql --module patients_with_matched_samples
DEFINE QUERY patients_tp
SELECT
ParticipantBarcode
FROM
[isb-cgc:tcga_201607_beta.Biospecimen_data]
WHERE
( SampleTypeLetterCode="TP" )
GROUP BY
ParticipantBarcode
ORDER BY
ParticipantBarcode
DEFINE QUERY patients_nt
SELECT
ParticipantBarcode
FROM
[isb-cgc:tcga_201607_beta.Biospecimen_data]
WHERE
( SampleTypeLetterCode="NT" )
GROUP BY
ParticipantBarcode
ORDER BY
ParticipantBarcode
DEFINE QUERY patients_both
SELECT
nt.ParticipantBarcode AS ParticipantBarcode
FROM ( $patients_nt ) AS nt
JOIN
( $patients_tp ) AS tp
ON
nt.ParticipantBarcode = tp.ParticipantBarcode
ORDER BY
ParticipantBarcode
In [5]:
bq.Query(patients_with_matched_samples).results().to_dataframe()
Out[5]:
It might be interesting to find out what the distribution of tumor types is for this list of patients with matched tumor-normal sample pairs. We can define a new SQL module that refers to the results of a previously defined query as long as we pass that reference in when we call bq.Query():
In [6]:
%%sql --module tumor_type_counts
# now we'll use this list to find what types of tumors these patients
# belong to:
SELECT
Study,
COUNT(*) AS n
FROM
[isb-cgc:tcga_201607_beta.Clinical_data]
WHERE
ParticipantBarcode IN ($patients_both)
GROUP BY
Study
ORDER BY
n DESC
In [7]:
bq.Query(tumor_type_counts,
patients_nt=patients_with_matched_samples.patients_nt,
patients_tp=patients_with_matched_samples.patients_tp,
patients_both=patients_with_matched_samples.patients_both).results().to_dataframe()
Out[7]:
In [ ]: